Linguistics Prague 2022 workshop

Practical worksheet

This workshop will be a practical introduction to ggplot and data visualisation for linguists.

You should have already installed 2 different programs on your computer - R and RStudio, see https://jamesbrandscience.github.io/tutorials/linguistics_prague_2022_workshop/installation_instructions.html#a-quick-tour-of-r.

Worksheet translations available

Disclaimer: may not be very accurate…

Installing and loading in packages

The firts we will need to do is install and load in the tidyverse, this will allow us to make our plots with ggplot and do other useful things.

Note, if you have already installed the tidyverse package you do not need to reinstall it, but maybe update it if you have not done so for a while. If you have it installed, just run the library(tidyverse) line.

install.packages("tidyverse")
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.0      ✔ stringr 1.4.1 
✔ readr   2.0.1      ✔ forcats 0.5.1 
Warning: package 'ggplot2' was built under R version 4.1.2
Warning: package 'tibble' was built under R version 4.1.2
Warning: package 'tidyr' was built under R version 4.1.2
Warning: package 'dplyr' was built under R version 4.1.2
Warning: package 'stringr' was built under R version 4.1.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Loading in the data

Next, we need to load in the data to use for our visualisations.

This can be done by running the following code, you should see it appear as an R object in the top right box, called data_exp

data_exp <- read.csv(url("https://jamesbrandscience.github.io/tutorials/linguistics_prague_2022_workshop/data/workshop_data.csv"))

ggplot basics

For the remainder of the session, we will focus on practically building visualisations using the ggplot2 package. As with many things in R, there is a whole lot more you can learn to do once you have the basics, so today we will focus on building up your practical knowledge of a number of ggplot2's most commonly used features.

Your first plot

Most of time when are making a ggplot visualisation, you will begin with by calling ggplot()…

ggplot()

Note this is a useless plot…

It is no surprise that this plot is just blank, as you would have noticed, R relies on you to write useful lines of code in order to get what you want.

Q. Try retyping the ggplot() function again, but this time press the tab button when you are inside the brackets. What do you see?

Tip. If you are not sure about any function in R, you can type ?ggplot and run that line of code to get a help page, use this if you want more information about any of the different things we cover later on too.

Intuitively, one of most important parts of a ggplot is the need for some sort of data to plot

Let’s try filling in this argument…

ggplot(data = data_exp)

Still a blank plot…

The next thing our plot needs is some information about what to plot from the dataset.

Lets try adding some more useful information…

Let’s try and plot the x1 and y1 variables, so that x1 is on the x axis and y1 is on the y axis.

This can be done by adding what ggplot calls an aesthetic, or aes() in code form…

Q. Add in an aes() argument to your ggplot(data = data_exp) code, specifying inside the brackets that x = x, y = y

ggplot(data = data_exp, aes(x = x, y = y))

Great, we now have some sort of plot, not a very useful one still, but the very very very minimum amount of information has been provided to get something out.

Lets store what we have as an object, we can do this by adding an assignment to our code…

my_plot <- ggplot(data = data_exp, aes(x = x, y = y))

Now, if we want to see our plot again, all we have to do is run the code my_plot

my_plot

Introducing layers and geoms

Notice that there is no real data being shown though, that is because we have not added any information about this part of the plot. There are lots of ways to visualise our x and y variables, but we need to specify this in our code…

The nice thing about ggplot is that it works through layers, this is a logical way to build up your plot - Imagine an artist working on a painting of a landscape…

  • First, they have a blank canvas
  • Second, they add in a background layer
  • Third, a layer of background features
  • Fourth, a layer of salient features
  • Finally, being artistic and making it look a bit fancy

This is similar to how most visualisations are also designed, building up layer by layer.

ggplot predominantly adds in these useful layers through what it calls geoms or a geometric object. There are lots of geoms available in ggplot. They will normally have useful default settings, so you do not always have to specify every detail about your layer, as ggplot will do most of the work for you. We will return to this later, showing why you might want to modify none, some or all of the defaults to get your visualisation looking just the way you want.

In the next few sections we will introduce a few of the most useful geoms that ggplot has available, highlighting the different ways you can visualise different types of data.

Let’s try adding some points to our plot that will plot the locations of the x and y values.

We can do this using a geom_point

my_plot +
  geom_point()

Within geom you can edit other aspects that you might want to change, such as the size or transparency.

my_plot +
  geom_point(size = 0.5, alpha = 0.5)

But remember, if you want to change things that are related directly to the dataset, you have to change them in the aes()

Let’s make the points have colour, based on the participant column

ggplot(data = data_exp, aes(x = x, y = y, colour = participant)) +
  geom_point(size = 0.5, alpha = 0.5)

If you do not like the legend in the plot, there are two ways you can remove it.

  1. Specifying you do not want the legend within the geom_point, by specifying show.legend = FALSE
ggplot(data = data_exp, aes(x = x, y = y, colour = participant)) +
  geom_point(size = 0.5, alpha = 0.5, show.legend = FALSE)

  1. If you want to remove legends from every geom you have, this might be a bit annoying, so instead you can specify this using theme(legend.position = "none")
ggplot(data = data_exp, aes(x = x, y = y, colour = participant)) +
  geom_point(size = 0.5, alpha = 0.5) +
  theme(legend.position = "none")

Creating summary data

We might want to visualise some summary data, such as the mean and SD to provide simple descriptive statistical visualisations. To do this we need to get the data in a suitable format.

1st we need to use pivot_longer to get it in a longer format, this will take the x and y columns and make put either x or y as the values of a new column coord, the actual x and y values will be in a new column called value

Then we use group_by to say that we want to calculate summary data per participant and coord

We use summarise, with the summary functions we want, e.g. mean = mean(value) gives the mean of the value column for each participant and coord

We use ungroup() at the end as the grouping is finished

data_exp_summary <- data_exp %>%
  pivot_longer(x:y, names_to = "coord", values_to = "value") %>%
  group_by(participant, coord) %>%
  summarise(mean = mean(value),
            sd = sd(value)) %>%
  ungroup()
`summarise()` has grouped output by 'participant'. You can override using the
`.groups` argument.

Now we have our summary data, we can use it to make a bar plot.

We can use a geom_col for this

ggplot(data = data_exp_summary, aes(x = participant, y = mean)) +
  geom_col()

This looks a bit strange, we should have two different values to plot the mean of x and y, per participant.

We can separate the data using facets

We can use facet_wrap, note that the variable you want to facet by is preceeded by a tilda ~

ggplot(data = data_exp_summary, aes(x = participant, y = mean)) +
  geom_col() +
  facet_wrap(~coord)

We can state how many rows or columns we want too

We will use 2 rows to make the visualisation less squashed horizontally

ggplot(data = data_exp_summary, aes(x = participant, y = mean)) +
  geom_col() +
  facet_wrap(~coord, nrow = 2)

Now we should try to fix the y axis, so that it ranges from 0 to 100, we can do this by scale_y_continuous() and saying that the limits of the axis are 0 and 100, this needs to be a vector, that is why it is within a c()

ggplot(data = data_exp_summary, aes(x = participant, y = mean)) +
  geom_col() +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100))

Now add some colour and make sure the legend is not shown. As it is a bar plot, we need to fill the bars, not colour them, so we need to specify fill=participant

ggplot(data = data_exp_summary, aes(x = participant, y = mean, fill = participant)) +
  geom_col() +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none")

We can also change other things within the theme function, for example if we want to change the axis text on the x axis, we can make rotate it at an angle, or change the size etc.

Let’s make the text at a 45 degree angle and justify it to the axis

We need to do this by axis.text.x = element_text(angle = 45, hjust = 1)

ggplot(data = data_exp_summary, aes(x = participant, y = mean, fill = participant)) +
  geom_col() +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Now we might want to add error bars to our plot, which we should have done when adding the first geom, but we can add it in now.

To do this we will use a geom_errorbar, not that we specify new aes values within the geom, these are ymin and ymax as the geom relates only to the y axis values, we calculates these as the mean and + or - the sd.

We also specify a width, this is 0.2, but see if it looks nicer as 1?

ggplot(data = data_exp_summary, aes(x = participant, y = mean, fill = participant)) +
  geom_col() +
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2) +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Other types of geoms

Bar plots are not always the nicest way to visualise data, especially as they only present summaries. To do this we will leave the summary data and return to the raw data data_exp

First we will need to make it in to long format again, we will call this data_exp_long

data_exp_long <- data_exp %>%
  pivot_longer(x:y, names_to = "coord", values_to = "value")

We can see some other geoms below

Boxplot

ggplot(data = data_exp_long, aes(x = participant, y = value, fill = participant)) +
  geom_boxplot() +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Violin

ggplot(data = data_exp_long, aes(x = participant, y = value, fill = participant)) +
  geom_violin() +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Half violin

To do this we will need to install another package for it to work, the gghalves package

install.packages("gghalves")

Then load in the package

library(gghalves)

Note that you can specify which side you want the half violin to be on and whether we want quantile lines to be drawn. Here the code put the half violins on the right/r and draws 25%, 50% and 75% quantile lines

ggplot(data = data_exp_long, aes(x = participant, y = value, fill = participant)) +
  geom_half_violin(side = "r", draw_quantiles = c(0.25, 0.5, 0.75)) +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Raincloud

We can add in some half points to the previous plot to make it a raincloud plot. These are set to the be on the left/l along with some size and transparency

ggplot(data = data_exp_long, aes(x = participant, y = value, fill = participant)) +
  geom_half_violin(side = "r", draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_half_point(size = 0.2, alpha = 0.5, side = "l") +
  facet_wrap(~coord, nrow = 2) +
  scale_y_continuous(limits = c(0, 100)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Linear models

We can present the relationship between two different variables, such as is a correlation or a linear model, by using a geom_smooth. To do this we need the data in a format where the x and y variables are distinct, so we will go back to our original data data_exp.

Note that I added in two arguments method = "lm", which means it is a linear model fit, and se = FALSE, which removes the standard error shading of the model.

ggplot(data = data_exp, aes(x = x, y = y)) +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'

It is normally a good idea to visualise the standard error of our models, luckily the default in a geom_smooth is se = TRUE, so we can just remove that argument

ggplot(data = data_exp, aes(x = x, y = y)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

We may also want to change the x and y axis so that they are on a scale that does not skew the visual interpretation to be a bigger effect than it might actually be

ggplot(data = data_exp, aes(x = x, y = y)) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0, 100))
`geom_smooth()` using formula 'y ~ x'

Let’s now make the data facetted by `participant so we can see if this model fit is similar for all participants

ggplot(data = data_exp, aes(x = x, y = y)) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0, 100)) +
  facet_wrap(~participant) +
  theme_bw()
`geom_smooth()` using formula 'y ~ x'

But we may also need to add in the raw data, so we can see whether these models are under or over fitting. We can add some colour too.

Note it matters which order you point your geoms, like a painting if you start with fine details first, but add larger geoms later in the code, they might not be easy to see

ggplot(data = data_exp, aes(x = x, y = y, colour = participant)) +
  geom_point(size = 0.2, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0, 100)) +
  facet_wrap(~participant) +
  theme_bw() +
  theme(legend.position = "none")
`geom_smooth()` using formula 'y ~ x'

Saving

You will want to save your visualisations and maybe include them in your publications, to do this we need to do the following

  1. store the plot as an R object, let’s call the above plot participant_data_plot
participant_data_plot <- ggplot(data = data_exp, aes(x = x, y = y, colour = participant)) +
  geom_point(size = 0.2, alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0, 100)) +
  facet_wrap(~participant) +
  theme_bw() +
  theme(legend.position = "none")
  1. Use ggsave to export the plot to your computer, we need to specify the plot plot = participant_data_plot and the filename filename = participant_data_plot.png note the .png is important as it tells your computer this is an image file
ggsave(plot = participant_data_plot, filename = "participant_data_plot.png")
Saving 8 x 5 in image
`geom_smooth()` using formula 'y ~ x'
  1. There are other defaults that you might want to change, such as the width, height and resolution/dpi, this normally means playing around with different values to get the plot to look right, but dpi = 400 is normally good.
ggsave(plot = participant_data_plot, filename = "participant_data_plot.png", width = 7, height = 7, dpi = 400)
`geom_smooth()` using formula 'y ~ x'

Animations

If we want present dynamic data, we can use animations to cycle through variables, making it easier to observe trends or patterns.

We will need two new packages gganimate and gifski

install.packages("gganmimate")
install.packages("gifski")
library(gganimate)
library(gifski)
Warning: package 'gifski' was built under R version 4.1.2

Now we can make a plot that will act as the basis of the animation, this should be what each frame of the animation will look like

Let’s make a simple x and y scatter plot called animation_plot

animation_plot <- ggplot(data = data_exp, aes(x,y)) +
  geom_point()

Next we have to tell it how we want it to be animated, i.e. which variable will it use to change the frames, this can be done with transition_states. The transition_length and state_length arguments are used to specify extra details. We want to animate based on participant.

We also will want to add a title to the animation, this is done with labs(title = "{closest_state}") where the {closest_state} takes the participant value for that frame, there is also so theme customisation of the title so it is easier to read

animation_plot <- animation_plot +
  transition_states(participant, transition_length = 3, state_length = 3) +
  labs(title = "{closest_state}") +
  theme(plot.title = element_text(size=22,hjust = 0.5))

Now we need to animate it

animation_plot <- animate(animation_plot, width = 600, height = 600, res = 100)

You can look at it by just typing the animation object name

animation_plot

Then save it

anim_save(animation = animation_plot, filename = "animation_plot.gif")